Variable selection using random forests
نویسندگان
چکیده
This paper proposes, focusing on random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001, to investigate two classical issues of variable selection. The first one is to find important variables for interpretation and the second one is more restrictive and try to design a good prediction model. The main contribution is twofold: to provide some insights about the behavior of the variable importance index based on random forests and to propose a strategy involving a ranking of explanatory variables using the random forests score of importance and a stepwise ascending variable introduction strategy.
منابع مشابه
Airborne Lidar Feature Selection for Urban Classification Using Random Forests
Various multi-echo and Full-waveform (FW) lidar features can be processed. In this paper, multiple classifers are applied to lidar feature selection for urban scene classification. Random forests are used since they provide an accurate classification and run efficiently on large datasets. Moreover, they return measures of variable importance for each class. The feature selection is obtained by ...
متن کاملHigh-Dimensional Variable Selection for Survival Data
The minimal depth of a maximal subtree is a dimensionless order statistic measuring the predictiveness of a variable in a survival tree. We derive the distribution of the minimal depth and use it for high-dimensional variable selection using random survival forests. In big p and small n problems (where p is the dimension and n is the sample size), the distribution of the minimal depth reveals a...
متن کاملRandom Forests: some methodological insights
This paper examines from an experimental perspective random forests, the increasingly used statistical method for classification and regression problems introduced by Leo Breiman in 2001. It first aims at confirming, known but sparse, advice for using random forests and at proposing some complementary remarks for both standard problems as well as high dimensional ones for which the number of va...
متن کاملRandom Forests-based Feature Selection for Land-use Classification Using Lidar Data and Orthoimagery
The development of lidar system, especially incorporated with high-resolution camera components, has shown great potential for urban classification. However, how to automatically select the best features for land-use classification is challenging. Random Forests, a newly developed machine learning algorithm, is receiving considerable attention in the field of image classification and pattern re...
متن کاملRandom forest Gini importance favours SNPs with large minor allele frequency: impact, sources and recommendations
The use of random forests is increasingly common in genetic association studies. The variable importance measure (VIM) that is automatically calculated as a by-product of the algorithm is often used to rank polymorphisms with respect to their ability to predict the investigated phenotype. Here, we investigate a characteristic of this methodology that may be considered as an important pitfall, n...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Pattern Recognition Letters
دوره 31 شماره
صفحات -
تاریخ انتشار 2010